May 10, 2021

Overview of presentation

  1. Introduction to COVID-19 World Vaccine Adverse Reactions Dataset

  2. Project work flow

  3. Project methods

    3.1 Overview of important packages and verbs used

    3.2 Challenges and solutions - Load, Clean and Augment

  4. Visualizations

  5. Modeling

  6. Conclusion and discussion

Introduction

COVID-19 World Vaccine Adverse Reactions

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

  • Data from the Vaccine Adverse Event Reporting System (VAERS) created by the Food and Drug Administration (FDA) and Centers for Disease Control and Prevention (CDC)
  • Contains 3 data sets:
    1. PATIENTS.CSV
    2. VACCINES.CSV
    3. SYMPTOMS.CSV
  • Data sets connected by patient IDs (VAERS_ID)

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

PATIENTS.CSV: Contains information about the individuals that received the vaccines

## # A tibble: 34,121 x 35
##   VAERS_ID RECVDATE  STATE AGE_YRS CAGE_YR CAGE_MO SEX   RPT_DATE   SYMPTOM_TEXT
##   <chr>    <chr>     <chr>   <dbl>   <dbl>   <dbl> <chr> <date>     <chr>       
## 1 0916600  01/01/20… TX         33      33      NA F     NA         "Right side…
## 2 0916601  01/01/20… CA         73      73      NA F     NA         "Approximat…
## 3 0916602  01/01/20… WA         23      23      NA F     NA         "About 15 m…
## # … with 34,118 more rows, and 26 more variables: DIED <chr>, DATEDIED <chr>,
## #   L_THREAT <chr>, ER_VISIT <chr>, HOSPITAL <chr>, HOSPDAYS <dbl>,
## #   X_STAY <chr>, DISABLE <chr>, RECOVD <chr>, VAX_DATE <chr>,
## #   ONSET_DATE <chr>, NUMDAYS <dbl>, LAB_DATA <chr>, V_ADMINBY <chr>,
## #   V_FUNDBY <chr>, OTHER_MEDS <chr>, CUR_ILL <chr>, HISTORY <chr>,
## #   PRIOR_VAX <chr>, SPLTTYPE <chr>, FORM_VERS <dbl>, TODAYS_DATE <chr>,
## #   BIRTH_DEFECT <chr>, OFC_VISIT <chr>, ER_ED_VISIT <chr>, ALLERGIES <chr>

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

VACCINES.CSV: Contains information about the received vaccine

## # A tibble: 34,630 x 8
##    VAERS_ID VAX_TYPE VAX_MANU         VAX_LOT VAX_DOSE_SERIES VAX_ROUTE VAX_SITE
##    <chr>    <chr>    <chr>            <chr>   <chr>           <chr>     <chr>   
##  1 0916600  COVID19  "MODERNA"        037K20A 1               IM        LA      
##  2 0916601  COVID19  "MODERNA"        025L20A 1               IM        RA      
##  3 0916602  COVID19  "PFIZER\\BIONTE… EL1284  1               IM        LA      
##  4 0916603  COVID19  "MODERNA"        unknown <NA>            <NA>      <NA>    
##  5 0916604  COVID19  "MODERNA"        <NA>    1               IM        LA      
##  6 0916606  COVID19  "MODERNA"        011J20A 1               IM        LA      
##  7 0916607  COVID19  "MODERNA"        <NA>    <NA>            IM        LA      
##  8 0916608  COVID19  "MODERNA"        <NA>    1               IM        LA      
##  9 0916609  COVID19  "MODERNA"        011J20… 1               IM        LA      
## 10 0916610  COVID19  "MODERNA"        <NA>    1               SYR       LA      
## # … with 34,620 more rows, and 1 more variable: VAX_NAME <chr>

Introduction: Dataset

COVID-19 World Vaccine Adverse Reactions

SYMPTOMS.CSV: Contains information about the symptoms experienced after vaccination

## # A tibble: 48,110 x 11
##   VAERS_ID SYMPTOM1     SYMPTOMVERSION1 SYMPTOM2     SYMPTOMVERSION2 SYMPTOM3   
##   <chr>    <chr>                  <dbl> <chr>                  <dbl> <chr>      
## 1 0916600  Dysphagia               23.1 Epiglottitis            23.1 <NA>       
## 2 0916601  Anxiety                 23.1 Dyspnoea                23.1 <NA>       
## 3 0916602  Chest disco…            23.1 Dysphagia               23.1 Pain in ex…
## 4 0916603  Dizziness               23.1 Fatigue                 23.1 Mobility d…
## 5 0916604  Injection s…            23.1 Injection s…            23.1 Injection …
## 6 0916606  Pharyngeal …            23.1 <NA>                    NA   <NA>       
## # … with 48,104 more rows, and 5 more variables: SYMPTOMVERSION3 <dbl>,
## #   SYMPTOM4 <chr>, SYMPTOMVERSION4 <dbl>, SYMPTOM5 <chr>,
## #   SYMPTOMVERSION5 <dbl>

Introduction: Aim

The aim of this project is to gain insight on the adverse effects of different Covid-19 vaccines and answer the following questions:

  • Do some vaccines cause more/different symptoms than others?

  • Do patients with some profiles get more/different symptoms?

  • Are certain symptoms correlated with death?

  • Is patient profile correlated with death?

  • Does taking anti-inflamatories reduce the chance of having symptoms?

Methods

Methods: Project workflow

  1. Load data sets (patients, vaccines, symptoms)
  2. Clean each data set individually
  3. Augment and merge the data sets
  4. Make visualizations
  5. Do modeling

Methods: Important packages and verbs

Load and clean

  • readr: read_csv(), write_csv()
  • dyplyr: filter(), select(), distinct(), mutate()
  • tidyr: replace_na()

Augment

  • dplyr: filter(), select(), mutate(), case_when(), arrange(), group_by(), count(), distinct(), summarise(), drop_na(), rename()
  • tidyr: pivot_longer(), pivot_wider(), inner_join(), full_join(), pluck()
  • stringr: regular expressions, str_c(), str_replace(), str_replace()

Analysis

  • ggplot: geom_bar(), geom_boxplot(), geom_tile(), geom_segment(), theme_minimal()
  • forcats: fct_reorder()
  • scales
  • patchwork
  • viridis
  • stats (?): glm(), prcomp()
  • broom: tidy(), glance()
  • purrr: map(), nest()

Methods: Dataset loading

Challenges and solutions

Patients, vaccines and symptoms datasets:

  • Multiple large files → keep them compressed as gz-files and only decompress when reading into R
  • Wrong column types automatically assigned by R → manually assign appropriate column types
  • NA strings (“NA”, “N/A”, “Unknown”, " "…) → assign NAs when loading data

Methods: Dataset cleaning

Challenges and solutions

Patients dataset:

  • Unwanted dirty/uniformative columns → select(-c(CAGE_YR, CAGE_MO, RPT_DATE … ))
  • NAs that should be interpreted as “no” → replace_na(ALLERGIES = “N”)
  • Row duplications → distinct()

Vaccines dataset:

  • Contains non-COVID19 vaccines → filter(VAX_TYPE == “COVID19”)
  • Contains vaccines of unknown manufacturer → filter(VAX_MANU != “UNKNOWN MANUFACTURER”)
  • Row duplications → distinct()
  • Duplicated IDs → add_count(VAERS_ID) %>% filter(n == 1) %>% select(-n)
  • Inconsistent naming of vaccines → rename()
  • Redundant and dirty columns → select(-c(VAX_NAME, VAX_LOT))

Symptoms dataset:

  • SYMPTOMVERSION1-5 columns are unneccessary → select(-c())

Methods: Data augmentation

Challenges and solutions

Patients data set:

  • Columns containing long string descriptions → Make tidy categorical (Y/N) variables
## # A tibble: 3 x 3
##   VAERS_ID OTHER_MEDS                     TAKES_ANTIINFLAMATORY
##   <chr>    <chr>                          <chr>                
## 1 0916983  <NA>                           N                    
## 2 0916988  Ibuprofen  PM the night before Y                    
## 3 0916996  Clobetasol, Benadryl           N
  • Dirty, redundant and uninformative columns → select(-c(ALLERGIES, OTHER_MEDS … ))

Symptoms data set:

  • Too many symptoms and dirty → extract top 20 occurring symptoms and turn them into tidy categorical (TRUE/FALSE) columns
  • Calculate total number of symptoms per patient → mutate() to add column (N_SYMPTOMS)

Methods: Data augmentation

Merging datasets

  • For visualizing, we need the wide format → inner_join(by = VAERS_ID)
  • For modelling, symptoms must be in long-format → pivot_longer() to create:
    • SYMPTOM column: top 20 symptom names
    • SYMPTOM_VALUE column: TRUE/FALSE

Methods: Analysis

Exploratory data analysis

  • Visualizations with ggplot()
  • Reduction of dimensionality (Principal Component Analysis) with prcomp()

Modelling and statistics

  • Logistic regression models with glm()
  • Proportions tests with chisq.test()

04_analysis_visualizations

04_analysis_visualizations - Age, sex and manufacturer distribution

## # A tibble: 3 x 2
##   SEX       n
##   <chr> <int>
## 1 F     24070
## 2 M      8514
## 3 <NA>    828
## # A tibble: 3 x 2
##   VAX_MANU            n
##   <chr>           <int>
## 1 JANSSEN          1106
## 2 MODERNA         16253
## 3 PFIZER-BIONTECH 16053

04_analysis_visualizations - Days until onset of symptoms vs. Age Group

Hypothesis: two peaks corresponding to the innate and acquired immune response

04_analysis_visualizations - Age/sex vs. number of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. number of symptoms

04_analysis_visualizations - Age vs. types of symptoms

04_analysis_visualizations - Sex vs. types of symptoms

04_analysis_visualizations - Vaccine manufacturer vs. types of symptoms

Modeling

Modeling

Logistic regression: death ~ patient profile

## # A tibble: 7 x 6
##   term           estimate std.error statistic  p.value odds_ratio
##   <chr>             <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)    -9.39      0.161    -58.2    0         0.0000832
## 2 SEXM            0.929     0.0573    16.2    4.00e-59  2.53     
## 3 AGE_YRS         0.0914    0.00207   44.1    0         1.10     
## 4 HAS_ALLERGIESY -0.0204    0.0605    -0.338  7.35e- 1  0.980    
## 5 HAS_ILLNESSY    1.08      0.0654    16.4    8.86e-61  2.93     
## 6 HAS_COVIDY     -0.113     0.142     -0.794  4.27e- 1  0.893    
## 7 HAD_COVIDY     -0.00375   0.195     -0.0193 9.85e- 1  0.996

Modeling

Logistic regression: death ~ patient profile

Modeling

Logistic regression: death ~ symptoms

## # A tibble: 20 x 6
##   term          estimate std.error statistic  p.value odds_ratio
##   <chr>            <dbl>     <dbl>     <dbl>    <dbl>      <dbl>
## 1 (Intercept)     -2.01     0.0287    -70.1  0             0.134
## 2 HEADACHETRUE    -1.67     0.156     -10.7  7.92e-27      0.188
## 3 PYREXIATRUE     -0.429    0.112      -3.82 1.34e- 4      0.651
## 4 CHILLSTRUE      -1.21     0.171      -7.11 1.17e-12      0.298
## 5 FATIGUETRUE     -0.367    0.115      -3.19 1.41e- 3      0.693
## 6 PAINTRUE        -0.913    0.153      -5.98 2.17e- 9      0.401
## 7 NAUSEATRUE      -0.621    0.139      -4.46 8.17e- 6      0.538
## 8 DIZZINESSTRUE   -2.17     0.193     -11.2  2.87e-29      0.114
## # … with 12 more rows

Modeling

Logistic regression: death ~ symptoms

Modeling

Many logistic regressions: each symptom ~ takes anti-inflamatory

## # A tibble: 20 x 9
##   SYMPTOM  estimate std.error statistic p.value conf.low conf.high odds_ratio
##   <chr>       <dbl>     <dbl>     <dbl>   <dbl>    <dbl>     <dbl>      <dbl>
## 1 HEADACHE  -0.164     0.0987    -1.67   0.0958   -0.362    0.0255      0.848
## 2 PYREXIA    0.0152    0.102      0.150  0.881    -0.189    0.211       1.02 
## 3 CHILLS    -0.121     0.109     -1.11   0.266    -0.340    0.0875      0.886
## 4 FATIGUE    0.0565    0.105      0.539  0.590    -0.154    0.258       1.06 
## 5 PAIN       0.0113    0.110      0.102  0.919    -0.210    0.222       1.01 
## # … with 15 more rows, and 1 more variable: identified_as <chr>

Modeling

Many logistic regressions: each symptom ~ takes anti-inflamatory

04_analysis_tests

04_analysis_tests

Chi-squared contingency table tests

DIED JANSSEN MODERNA PFIZER-BIONTECH
N 1090 15281 15212
Y 16 972 841
DIED F M
N 23271 7523
Y 799 991

04_analysis_pca

04_analysis_pca - Important tools used

Important verbs and tools used:

  • prcomp()
  • augment ()

04_analysis_pca - PCA biplot

04_analysis_pca - Rotation matrix

04_analysis_pca - Scree plot

Conclusion and discussion

References